Clustering is an unsupervised learning technique, used in fields like text mining, image analysis, market segmentation, and bioinformatics. It is used to group similar objects/data points based on their similarity.
Since clustering lacks predefined labels, validating the quality of the resulting clusters is essential.
Clustering validation, which evaluates the goodness of clustering results, has long been recognized as one of the vital issues essential to the success of clustering applications. Cluster validation is the process of evaluating the quality and effectiveness of clustering results in unsupervised learning. It helps determine how well the data has been grouped into clusters and whether the clustering makes sense for the problem.
Clustering validation measures can be categorized into three main types: internal clustering validation, external clustering validation, and relative clustering validation. The main difference is whether or not external information is used for clustering validation.
Internal Clustering Validation
It involves evaluating the clustering quality using only the data and the cluster assignments (no ground truth labels needed). Internal validation measures only rely on information in the data. The internal measures evaluate the goodness of a clustering structure without respect to external information. Hence internal validation measures can be used to choose the best clustering algorithm as well as the optimal cluster number without any additional information. In practice, external
information such as class labels is often not available in many application scenarios. Therefore, in a situation where there is no external information available, internal validation measures are the only option for cluster validation.
As the goal of clustering is to make objects within the same cluster similar and objects in different clusters distinct, internal validation measures are often based on the following two criteria.
I. Compactness: Intra-Cluster Similarity
It measures how closely related the objects in a cluster are. A group of measures evaluate cluster
compactness based on variance. Lower variance indicates better compactness. Also, numerous measures that estimate the cluster compactness based on distance, such as maximum or average pairwise distance, and maximum or average center-based distance.
One method to find compactedness is to determine the centroids of the different clusters, and the sum of squared (SSQ) distances is reported as the corresponding objective function. Smaller values of this measure are indicative of better cluster quality. This measure is obviously more optimized to distance-based algorithms, such as k-means, as opposed to a density-based method, such as DBSCAN. Another problem with SSQ is that the absolute distances provide no meaningful information to the user about the quality of the underlying clusters.
Other than SSQ, we can use the Average distance between all points and the cluster center, the Diameter of a Cluster(Maximum distance between any two points within a cluster), Mean Pairwise Distance, and Variance Within Clusters.
II. Separation: Inter-Cluster Dissimilarity
It measures how distinct or well-separated a cluster is from other clusters. For example, the pairwise
distances between cluster centers or the pairwise minimum distances between objects in different clusters are widely used as measures of separation. Also, measures based on density are used in some indices.
Example Metrics for Internal Validation:
It was proposed by Belgian statistician Peter Rousseeuw in 1987. The Silhouette Score is computed for each data point individually. Each point gets its own score based on:
How close it is to other points in its own cluster (intra-cluster distance)
How far it is from points in the nearest neighboring cluster (nearest-cluster distance)
We can then compute the average score of the silhouettes for all the objects(data points).
The silhouette ranges from −1 to +1, where a high value indicates that the object is well-matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. A clustering with an average silhouette width of over 0.7 is considered to be "strong", a value over 0.5 "reasonable" and over 0.25 "weak".
Assuming that the dataset has been clustered into “k” clusters. Then, the silhouette score for a data point i ∈ CR(data point in the cluster CR) is calculated as follows:
Average intra-cluster distance a(i) is calculated with the formula:
Nearest-cluster distance b(i) is calculated with the formula:
We now define a silhouette (value) of one data point “i” as follows:
NOTE: The silhouette can be calculated with any distance metric, such as the Euclidean distance or the Manhattan distance.
b(i) >> a(i) implies s(i) is close to 1, which means that “i” is badly matched to even its nearest-neighbouring cluster. Thus an s (i) close to 1 means that the data is appropriately clustered.
b(i) << a(i) implies s(i) is close to -1, which means that “i” is badly matched to its own cluster. Thus an s (i) close to -1 means that “i” would be more appropriate to be clustered in its nearest-neighbouring cluster.
An s (i) near zero means that the data point “i” is on the border of two natural clusters. This happens when b(i) and a(i) have similar values, implying that the point is not clearly assigned to any one cluster. It could belong equally well to either its current cluster or a neighboring one. This may suggest ambiguous clustering or overlap between clusters at that location.
Thus the mean s (i) over all data of the entire dataset is a measure of how appropriately the data have been clustered. If there are too many or too few clusters, as may occur when a poor choice of k is used in the clustering algorithm, some of the clusters will typically display much narrower silhouettes((i.e., most points in the cluster have silhouette scores close to 0) than the rest. One can also increase the likelihood of the silhouette being maximized at the correct number of clusters by re-scaling the data using feature weights that are cluster-specific.
1.1.1 Silhouette Coefficient
To determine the natural number of clusters within a dataset, we can use the silhouette coefficient:
= mean s (i) over all data of the entire dataset for a specific number of clusters.
1.1.2 Drawbacks
With increasing dimensionality of the data, it becomes difficult to achieve such high values because of the curse of dimensionality, as the distances become more similar.
The silhouette score is specialized for measuring cluster quality when the clusters are convex-shaped, and may not perform well if the data clusters have irregular shapes or are of varying sizes.
Computing the silhouette coefficient needs all O (N2) pairwise distances, making this evaluation much more costly than clustering with k-means.
1.1.3 Simplified Silhouette
To save costly computation, we can replace a(i) and b(i) (which calculate average of O(N2) pairwise distances) with a’(i) and b’(i) which are as follows:
Where d(x1,x2) is the distance between points x1 and x2.
For a cluster ‘C’ μC is the centroid.
The number of distances used in simplified silhouette computation is O(Nk), where k is the number of clusters, and the added benefit is that a’(i) is always defined.
Introduced in 1979, this internal cluster validation metric is also based on compactedness and separation. It is calculated as the average similarity measure of each cluster with the cluster most similar to it. In this context, similarity is defined as the ratio between inter-cluster and intra-cluster distances. As such, this index ranks well-separated clusters with less dispersion as having a better score.
1.2.1 Some formulae
Si = The ‘within-cluster’ scatter {for cluster i}:
Let Ci be a cluster of data points. Let Xj be a data point assigned to cluster Ci.
For cluster Ci, let its centroid be Ai, and Ti be the number of points in the cluster.
Then the scatter within the cluster, Si, is defined as the average distance between the data points in cluster i and the centroid of the cluster. The formula for Si is:
Usually, the value of p is 2, which makes the distance a Euclidean-distance function. Many other distance metrics can be used, in the case of manifolds and higher dimensional data, where the euclidean distance may not be the best measure for determining the clusters.
NOTE: Typically, the clustering algorithm that we use has a distance function- typically Euclidean- distance. But, many times we can change this distance function- in that case, it is important to match the distance metric used in the formula of Si with the one used in our clustering algorithm.
Mij = Separation between cluster i and cluster j
The centroids of two clusters Ci and Cj are Ai and Aj respectively. If the dataset has n-dimensional points, let ak,i be the kth element of Ai. Similarly, the kth element of Aj would be ak,j. Then separation between the clusters i and j is given as:
Typically the value of p is 2.
Let Ri,j be a measure of how good the clustering scheme is. For this, the separation between the two different clusters i and j should be as large as possible, and the ‘within-cluster’ scatter in cluster i has to be as low as possible.
Hence the Davies–Bouldin index is defined as the ratio of Si and Mi,j such that:
Ri,j ≥. 0
Ri,j = Rj,i
If Sj ≥. Sk and Mi,j = Mi,k then Ri,j ≤ Ri,k.
If we keep one cluster as it is and change cluster j to cluster k- which is at the same separation from cluster i as cluster j was, and the scatter within cluster k is less than the scatter within j, then the clustering scheme would be better for i and k rather than for i and j.
If Sj =. Sk and Mi,j ≤ Mi,k then Ri,j > Ri,k, according to the logic of the above point.
A solution that satisfies these properties is:
Then for a cluster i, Di is defined as the max value of Ri,j for different clusters j≠i.
Now we take the average of the Di’s over all the clusters. If N is the number of clusters,
DB is called the Davies–Bouldin index.
Lower index values indicate a better clustering result.
1.2.3 Intuition
These conditions constrain the index so defined to be symmetric and non-negative. Due to the way it is defined, as a function of the ratio of the within-cluster scatter, to the between-cluster separation, a lower value will mean that the clustering is better. It happens to be the average similarity between each cluster and its most similar one, averaged over all the clusters, where the similarity is defined as Si above. This affirms the idea that no cluster has to be similar to another, and hence the best clustering scheme essentially minimizes the Davies–Bouldin index. This index thus defined is an average over all the i clusters, and hence a good measure of deciding how many clusters actually exist in the data is to plot it against the number of clusters it is calculated over. The number i for which this value is the lowest is a good measure of the number of clusters the data could be ideally classified into.
1.2.4 Draback and Soft DBI
DBI provides a measure of how good the clustering scheme is, but it does not guarantee the clusters are useful, meaningful, or optimal for retrieval.
The Davies–Bouldin Index has been extended to support soft clustering, using a fuzzy logic framework. This version incorporates membership values from algorithms like fuzzy c-means, where each element can belong to multiple clusters to varying degrees. As a result, the index evaluates not just compactness and separation, but also the degree of membership, making it more suitable for fuzzy clustering scenarios.
2)External Clustering Validation
External validation measures use external information not present in the data. Since external validation measures know the “true” cluster number in advance, they are mainly used for choosing an optimal clustering algorithm on a specific data set. On the other hand, internal validation measures can be used to choose the best clustering algorithm as well as the optimal cluster number
without any additional information. An example of an external validation measure is entropy, which evaluates the “purity” of clusters based on the given class labels.
NOTE: In practice, external information such as class labels is often not available in many application scenarios. However, when synthetic data is generated from known benchmarks, it is possible to associate cluster identifiers with the generated records.
In the context of real data sets, these goals can be approximately achieved with the use of class labels when they are available. The major risk with the use of class labels is that these labels are based on application-specific properties of that data set and may not reflect the natural clusters in the underlying data. Nevertheless, such criteria are still preferable to internal methods because they can usually avoid consistent bias in evaluations when used over multiple data sets.
Example Metrics for External Validation:
The Rand Index (named after William M. Rand) is a measure of the similarity between two data clusterings. It is used to compare a predicted clustering (e.g., from a clustering algorithm like K-Means) to a ground truth classification (if available), or to another clustering.
One can also view the Rand index as a measure of the percentage of correct decisions made by the algorithm.
2.1.1 Definitions and Intuition
A Clustering is essentially partitions of the original dataset X={o1,o2,.....on}.
Hence a cluster is just a subset of X.
Let X and Y be the clusterings we need to compare. One of these is decided by our clustering algorithm and the other is typically a ground truth labeling or reference clustering used for evaluation.
Suppose X = {X1,X2,.....Xs} (s subsets)
And Y = {Y1,Y2,.....Yr} (r subsets)
Define the following:
a, the number of pairs of elements in S that are in the same subset in X and in the same subset in Y
b, the number of pairs of elements in S that are in different subsets in X and in different subsets in Y
c, the number of pairs of elements in S that are in the same subset in X and in different subsets in Y
d, the number of pairs of elements in S that are in different subsets in X and in the same subset in Y
The Rand Index measures the proportion of pairwise agreements relative to all possible pairs.
The number of possible agreements between the two clusterings is clearly a+b, and the number of all possible pairs is a+b+c+d, also equal to the number of ways to select two points out of the dataset.
The Rand index, R, is:
The Rand index has a value between 0 and 1, with 0 indicating that the two data clusterings do not agree on any pair of points and 1 indicating that the data clusterings are exactly the same.
2.1.2 Contingency Table
The contingency table is an r*s matrix, where each entry nij represents the number of elements that are:
in cluster Xi in the ground truth, and
in cluster Yj in the predicted clustering.
Hence:
Let us define a few terms corresponding to the contingency matrix
nij is the number of elements in both cluster iii from true labels and cluster jjj from predicted labels.
ai=total elements in true cluster i
bj= total elements in predicted cluster j
n= total number of elements
2.1.3 Adjusted Rand Index(ARI)
The Adjusted Rand Index using the Permutation Model is:
The Adjusted Rand Index (ARI) corrects the Rand Index (RI) for chance by considering how likely a given level of agreement between two clusterings would be due to random chance (i.e. permutations of labels).
It does this by comparing the observed similarity to what would be expected under a random model.
Traditionally, this correction uses the Permutation Model, where:
The number and sizes of clusters are fixed.
Random clusterings are generated by shuffling elements between those fixed clusters.
However, in practice, this model often doesn’t reflect reality — especially in methods like K-means, where:
The number of clusters is fixed by the user,
However, the sizes of the clusters vary and are learned from the data.
Because of these limitations, newer versions of ARI have been developed that use different random models to better match real clustering behavior.
Though the Rand Index may only yield a value between 0 and +1, the adjusted Rand Index can yield negative values if the index is less than the expected index.
3.2.1 Information Content of an event
The information content, also called the surprisal or self-information, of an event E is a function that increases as the probability p ( E ) of an event decreases. When p ( E ) is close to 1, the surprisal of the event is low, but if p ( E ) is close to 0, the surprisal of the event is high. This relationship is described by the function:
3.2.2 Entropy of a Discrete Variable
Entropy Η (of a discrete random variable X , which takes values in the set X and is distributed according to p : X → [ 0 , 1 ] such that p ( x ) := P [ X = x ]) :
In the case of p ( x ) = 0 for some x ∈ X, the value of the corresponding summand 0 logb(0) is taken to be 0, which is consistent with the limit:
Here E is the expected value operator, and I is the information content of X. I( X ) is itself a random variable.
3.2.3 Conditional Entropy
This quantity should be understood as the remaining randomness in the
random variable X given the random variable Y.
The conditional entropy of two variables X and Y taking values from sets respectively, as:
3.2.4 What is Mutual Information(MI)
MI quantifies the "amount of
information" obtained about one random variable by observing the
other random variable.
Let ( X , Y ) be a pair of discrete random variables with values from the sets . . If their joint distribution is P ( X , Y ) and the marginal distributions are PX and PY, the mutual information is defined as
The mutual information of two jointly discrete random variables X and Y is calculated as a double sum:
where P(X,Y) is the joint probability mass function of X and Y, and PX and PY are the marginal probability mass functions of X and Y respectively.
3.2.5 Normalised Mutual Information(NMI)
Normalized Mutual Information is then calculated by normalizing the Mutual Information using the following formula:
Let clustering labels be the random variable C and ground truth labels T, then:
If cluster labels and true labels are
independent, then:
I(C;T)=0⇒NMI=0
The clustering has no relationship to the actual class structure.
If the clustering exactly matches the ground truth (modulo label
permutation), then:
I(C;T)=H(C)=H(T)⇒NMI=1
The best possible clustering result.
Think of clustering as a noisy version of the true class labels.
If the clustering is imperfect but
informative, mutual information is
partial:
0<I(C;T)<min(H(C),H(T))
Interpretation: More noise = lower clustering quality → captured by MI or NMI.
Resources
DATA CLUSTERING Algorithms and Applications, Edited by Charu C. Aggarwal, Chandan K. Reddy
How to Evaluate the Performance of Clustering Algorithms in Python? (Evaluation of Clustering)
Silhouette (clustering)- Validating Clustering Models- Unsupervised Machine Learning
Cluster Analysis in Python - Silhouette, Calinski Harabasz, and Davies Bouldin for KMeans